Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 17

The “awk” command extracts the first column and fourth column from “SRR030834_tab.

txt” and prints the two columns separated by a tab. The output is directed to a new text file

“SRR030834_seq.txt” (Figure 1.8).

Linux commands allow us to do multi-step operations. Assume that we want to create a

FASTA file from the FASTQ file; we can do that in multiple steps. First, we need to extract

both IDs and sequences in a file as we did above, then we can remove “@” symbol leaving

only the IDs, then we need to add “>” in the beginning of each line with no space between

the “>” and the IDs, and finally, we separate the two columns, forming the definition line

(defline) of FASTA and the sequence, store them in a file, and delete the temporary files.

cat SRR030834.fastq | paste - - - - \

> SRR030834_tab.tmp

awk ‘{print $1 “\t” $4}’ SRR030834_tab.tmp \

| sed ‘s/@//g’ > SRR030834_seq.tmp

sed -i ‘s/^/>/’ SRR030834_seq.tmp

awk ‘{print $1, “\n” $2}’ SRR030834_seq.tmp \

> SRR030834.fasta

rm *.tmp

In the FASTA format, as shown in Figure 1.9, each entry contains a definition line and a

sequence. The defline begins with “>” and can contain an identifier immediately after “>”

(no whitespace in between).

FIGURE 1.8 Extracting IDs and sequence of a FASTQ file.

FIGURE 1.9 Extracting FASTA sequence from the FASTQ file.